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Abstract 


An ontology is an effective formal representation of knowledge used commonly in artifi¬ 
cial intelligence, semantic web, software engineering, and information retrieval. In open 
and distance learning, ontologies are used as knowledge bases for e-learning supplements, 
educational recommenders, and question answering systems that support students with 
much needed resources. In such systems, ontology construction is one of the most impor¬ 
tant phases. Since there are abundant documents on the Internet, useful learning materials 
can be acquired openly with the use of an ontology. However, due to the lack of system 
support for ontology construction, it is difficult to construct self-instructional materials for 
Vietnamese people. In general, the cost of manual acquisition of ontologies from domain 
documents and expert knowledge is too high. Therefore, we present a support system for 
Vietnamese ontology construction using pattern-based mechanisms to discover Vietnam¬ 
ese concepts and conceptual relations from Vietnamese text documents. In this system, 
we use the combination of statistics-based, data mining, and Vietnamese natural language 
processing methods to develop concept and conceptual relation extraction algorithms to 
discover knowledge from Vietnamese text documents. From the experiments, we show that 
our approach provides a feasible solution to build Vietnamese ontologies used for support¬ 
ing systems in education. 

Keywords: Ontology; concept discovery; conceptual relation; text mining; lexical pattern; 
natural language processing 







A Semi-Automatic Approach to Construct Vietnamese Ontology from Online Text 


Nguyen and Yang 


Introduction 


An ontology is a formal, explicit specification of a shared conceptualization (Noy & Mc- 
Guinness, 2001). Ontologies that belong to a specific domain are constructed from knowl¬ 
edge about domain concepts, their properties and instances, and the conceptual relations 
between them. In recent years, many semantic-based intelligent systems such as searching 
systems, recommender systems, and question answering systems have used ontologies as 
their knowledge bases. In education and e-learning, many researchers have built learning 
support systems that take advantage of ontologies. Li and Rui (2005) proposed a novel way 
to organize learning content into small “atomic” units called learning objects and system- 
ized them together with their ontology into knowledge bases used for a recommendation 
mechanism. Ana et al. (2009) built a recommender system in which a domain ontologi¬ 
cal model is presented as support to Venezuelan students’ decision making for study op¬ 
portunities. Saman et al. (2012) developed a knowledge-based and personalized e-learning 
recommendation system based on ontology to improve the quality of an e-learning system. 
These ontologies were constructed manually by using expert knowledge obtained from 
many resources and documents. 

Due to their availability and abundance, text documents are one of the most popular types 
of knowledge sources for experts to construct their domain ontologies. Many research stud¬ 
ies have been done on text mining and ontology construction using concept/entity extrac¬ 
tion and conceptual relation discovery. Text mining is a subsection of data mining which 
could discover useful and hidden patterns or information from text. It has been used widely 
in many fields such as information retrieval, linguistics, knowledge engineering, and bioin¬ 
formatics. Among text mining tasks, concept/entity extraction (concept mining) is applied 
extensively in many applications such as document summarization, question answer sys¬ 
tems, taxonomy construction, and ontology construction. Most concept mining methods 
are based on linguistic rules, statistics, or a combination of both (Zhou & Wang, 2010). 
Other research studies also use frequent pattern mining and association rule mining for 
discovering concepts and conceptual relations from text (Maedche & Staab, 2000, 2001; 
Zhou & Wang, 2010; Chen, Zhang, Li, & Li, 2005). 

Ontology construction requires efforts to uncover and organize relevant domain knowledge 
in a suitable structure according to the purpose of the ontology’s usage. This can be done 
manually or by using automatic or semi-automatic methods, in which learning methods 
and knowledge engineering are applied to extract concepts and conceptual relations from 
domain documents. 

In manual construction approaches, domain experts play an essential role. Many tasks are 
done by these experts: covering domain terms (concepts), defining classes and class hierar¬ 
chies, creating class slots (properties), filling slot values, and generating instances (Noy & 
McGuinness, 2001). Since every task is executed and verified by humans, the constructed 
ontologies tend to have a high level of accurate, reasonable, and adequate context. How¬ 
ever, it requires a large amount of human effort and time, especially for large-scale domains 
such as the semantic web. 
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By contrast, fully automatic ontology construction methods try to learn and extract knowl¬ 
edge from domain documents without human supervision. For instance, Christian and 
Alfonso introduced an automatic ontology construction using bibliographic information 
(Blaschke & Valencia, 2002). Maedche and Staab presented an ontology learning frame¬ 
work from the semantic web through ontology import, extraction, pruning, refinement, and 
evaluation mechanisms (Maedche & Staab, 2001). Lee et al. (2007) presented an episode- 
based ontology construction mechanism from text documents and used a fuzzy inference 
mechanism for Chinese text ontology learning. Unfortunately, these methods are usually 
difficult to implement and limited in specific domains since many domain-specific deci¬ 
sions must be made to adequately specify the domain of interest (Jaimes & Smith, 2003). 
In addition, learning the knowledge base from unconstructed data is cognitive work that 
needs many supporting studies, and the concept hierarchy acquisition is one of the largest 
challenges. 

Summarizing the above approaches, a semi-automatic ontology construction method is the 
most common approach in which information extraction techniques are used under the 
supervision of humans. Such methods include the learning modules to extract concepts and 
conceptual relations from domain documents. They require expert knowledge to verify the 
obtained information and decide which information should be included in the ontology. 
In English, many frameworks and plugins have been built to help users construct ontolo¬ 
gies semi-automatically. For example, TextToOnto proposed by Maedche and Staab (2000) 
used generalized association rules to find out the co-occurrences between items and rela¬ 
tions between them. OntoLT is a Protege plugin that extracts concepts and relations from 
annotated documents for ontology construction. 

Typically, taxonomy is needed in ontology acquisition tasks to construct the concept hier¬ 
archical structure of the ontology. In English, taxonomy-based approaches often use Word- 
Net as a super taxonomy to determine the conceptual relations between concepts. In Chi¬ 
nese, HowNet has been used with the same role. When taxonomies are not available (or for 
other reasons), a nontaxonomy approach is considered using learning algorithms (e.g., Lee, 
Kao, Kuo, & Wang, 2007; Maedche & Staab, 2001; Blaschke & Valencia, 2002). 

To extract candidate terms, the well-known statistical measurement TF-IDF can be used 
(Lee, Kao, Kuo, & Wang, 2007; Zheng, Dou, Wu & Li, 2007). Association rules or frequent 
patterns are mined to discover the co-occurrences and semantic relations between terms 
(Maedche & Staab, 2000, 2001; Zheng, Dou, Wu, & Li, 2007). Linguistic rules were also 
used in research (e.g., Zhou & Wang, 2010; Chen, Zhang, Li, & Li, 2005) in which pre¬ 
defined lexical patterns were used to extract candidate concepts by a bootstrapping mecha¬ 
nism. In Vietnamese, Nguyen and Phan (2009) proposed a hybrid approach which com¬ 
bines lexical rule-based and ontology-based methods to extract key terms and phrases from 
Vietnamese text. 

In this research, we propose a semi-automatic approach to extract concepts and conceptual 
relations from Vietnamese text documents by using a combination of text mining tech¬ 
niques and statistics-based methods. Concepts will be discovered not only based on the TF- 
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IDF measure but also by applying lexical patterns and association rules mining. The reason 
to use a combination of various techniques is shown in Table l. We also aim to compare 
the performance of various concept discovery algorithms and the combination of them to 
determine the best extracting approach for Vietnamese text documents. 

Table l 

Comparison of Used Techniques 


Techniques 

Based on 

Weaknesses 

Statistics-based 

Importance of terms - TFIDF 

Easily affected by noises 


Co-occurrences of terms in docu¬ 
ments 

Does not consider semantic aspect of 

documents 


Association rules 


Lexical rule-based 

Predefined linguistic rules 

Hard to build a complete rule set cov¬ 
ering all language cases 

Combination of statistics- 

based and lexical rule- 

based 

Taking into account both statistics 
and linguistic characteristics of terms. 



Proposed Method 


In this section, we present our proposed system, called Vietnamese Text To Ontology, or 
ViText 20 nto, along with learning techniques to discover concepts and conceptual relations 
from Vietnamese text documents. Our contribution can be stated as follows: Given a set 
of Vietnamese text documents in a specific domain, our system can support the user to 
construct an ontology using a semi-automatic approach. The resulting ontology contains 
concepts and instances organized in an appropriate hierarchy. 

System Architecture 

To construct an ontology, domain knowledge must be discovered and organized in a con¬ 
ceptual hierarchical structure. ViText 20 nto employs a semi-automatic approach where 
discovery methods are used in combination with human supervision. From this perspec¬ 
tive, an interactive mechanism is established between the system and users where the con¬ 
struction process is iterative and cyclic. After each iteration, the conceptual hierarchy is ex¬ 
tended and verified by users such that the users can incrementally discover more concepts 
and relations based on the assessed concepts. The system architecture is shown in Figure l, 
which includes the following components. 
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Figure l. System architecture. 

(1) A Vietnamese natural language processing module is used to make the Vietnamese text 
documents ready for extraction algorithms. It is a set of Vietnamese processing tools to 
perform tokenizing, part of speech tagging (POSTagging), and chunking. The output of this 
module is annotated documents being stored in text files. A small convertor is created to 
read these files and converts them into a compatible format that can be used by GATE. 

(2) A learning and discovering component is used for extracting concepts and conceptual 
relations from annotated documents. We use various learning and discovering algorithms, 
including pattern-based, statistics-based, and association-based approaches. To imple¬ 
ment pattern-based learning, we use JAPE (Java Annotation Pattern Engine) which is an 
element of the GATE framework. JAPE provides finite state transduction over annotations 
that let us extract predefined patterns based on rules written in a specific grammar. 

(3) Lexical patterns contain lexical rules written using JAPE grammar which are used for 
pattern-based learning. They are constructed based on Vietnamese syntactic rules. Apply¬ 
ing these rules on the corpus using JAPE, we can extract concepts and conceptual relations 
from the matched patterns. 

Vietnamese Language Processing 

We use Vietnamese language processing tools provided by the project of Building Basic Re¬ 
sources and Tools for Vietnamese Language and Speech Processing (VLSP) for preprocess¬ 
ing Vietnamese textual corpora. The processing components have the following features. 

Vietnamese word tokenization. 

Due to the characteristics of Vietnamese, a word might contain only one individual word 
(one morpheme) or a compound of two or more individual words (many morphemes). This 
tool identifies words and tokenizes sentences into separate tokens. Resulting tokenized 
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documents are used for further analysis tasks. 

Vietnamese part of speech tagging (POS tagging). 

As discovered concepts are mostly nouns, proper nouns, and noun phrases, POS tags play 
an essential role in both syntactic- and semantic-based learning for ontology acquisition. 
The POS tagger uses tokenized documents as input and assigns a POS label for all tokens. 

Vietnamese chunking. 

Chunking is used to divide each sentence into frames containing one or more words where 
each frame has a specific grammatical role in the sentence. Segmenting sentences into 
chunks helps determine grammatical roles of elements in sentences; hence, it is useful for 
learning and extracting. In our extraction rule sets, noun phrases and verb phrases are used 
as majority units of the patterns. Chunking frames are also used in association rule mining 
where phrases are used as input. 

Stop words removing. 

There are many words having high frequencies of occurrence in Vietnamese text while they 
contribute very little to the subject of sentences. To avoid noises caused by these words, we 
apply stop word removing when computing the TF-IDF of terms. 

Learning Algorithms 

In this research, the purpose of the ontology learning task is to discover concepts and con¬ 
ceptual relations. We use a combination of lexical pattern-based, frequent sequence-based, 
and statistics-based methods to overcome some drawbacks in each of the individual meth¬ 
ods. Figure 2 shows the model of learning and discovery components. 
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Figure 2 . Learning model. 

Overall Construction Process 

The overall ontology construction process in the proposed system is illustrated in Figure 3. 
Initially, a user prepares an input corpus. Then the text files are put into the Vietnamese 
natural language processing module for tokenizing, POS tagging, and chunking. Processed 
documents are converted into the specific document format of GATE using our own conver¬ 
tor. These Vietnamese text documents are ready for the learning process. 

Firstly, candidate concepts are extracted and presented to our user interface. Note that 
users can specify the TF-IDF threshold and minimum support to the extraction algorithm. 
Concepts will be sorted in descending order by the TF-IDF score to help users select impor¬ 
tant concepts. At this step, users may only have a small number of concepts to select as seed 
concepts. Then a prefix-based discovery algorithm will be run to generate concept trees. 
Again, users will select relevant concepts as input to the relation extraction phrase. In this 
phrase, pattern-based and association rule-based relation extraction methods are executed. 
Matched patterns and association rules satisfying the minimum support and confidence are 
sent to the user interface in the form of a relation between a pair of concepts. 
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Figure 3. Overall construction process. 

Users will make the final decision to select relevant concepts and conceptual relations. The 
selected result is exported to the ontology in OWL (web ontology language) by using the 
Jena toolkit. 

In this process, expert knowledge is embedded into the ontology in two steps where con¬ 
cepts and relations are presented to users for selection. The resulting ontology contains all 
selected concepts and relations that can be edited easily using some ontology editing tools 
such as Protege to meet user expectations. 

Concept Discovery 


TF-IDF-based candidate term selection. 


The well-known term weight TF-IDF is used to measure the importance of individual terms 
contributing to documents. Important terms, such as terms having higher TF-IDF scores, 
will be selected based on a user-defined threshold. TF-IDF of term T. in a document d. is 
computed by the following equation: 

t fW(T„ dj = tf(T„ dj) X bg( |d; ^ d| ) 


where tf (T it — — L -'_ tf (T t , cE ; ) — „ L -'_ is the term frequency of a term T. in a 
£*k n kj £*k n kj 

document d, n,, is the number of occurrences for T. in d, the dominator is the size of d j? and 
l °gCTT^ 7 )^g(^- — ) is the inverse document frequency of T.. 

\diT i Ed\ | diT^e d| 
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We construct a set S™™ of candidate terms whose TF-IDF values exceed the threshold 8: 

TFIDF 

S rra5F = {tjlTF/DFte) > S}S TFIDF = {tilTFIDFfa) > 5} 

where t. refers to one term in the documents. 

i 

Lexical pattern-based candidate term and phrase selection. 

TF-IDF can be used to select important individual terms. However, a Vietnamese concept 
often consists of multiple terms. To discover multiple-term concepts, a lexical pattern- 
based approach is used. We built a set of lexical rules based on JAPE grammar to discover 
proper nouns and noun phrases. Input sentences are processed by the finite state trans¬ 
ducer provided by JAPE, in which matched patterns will be discovered and annotated. 

According to Vietnamese grammatical characteristics, the following patterns can be used 
for noun phrase and proper noun identification: 

Noun+ Noun 

Noun* (Noun | ProperNoun) (Adj | Noun)* 

Noun+ Verb (Adj | Noun)+ 

Noun* ProperNoun+ Number 

where “|” means or, “+” means one or more occurrences, and “*” means zero or more oc¬ 
currences. The last pattern listed above is used for identifying proper nouns which end with 
a number, such as Windows Mobile 6.0 and iPhone 4. 

Here is an example of a noun phrase: [Cong ty] N [trach nhiem] N [hCm han] Adj [VinaCom] NP ,( 
means [Limited] Adj [Company] N [VinaCom] NP ). 

Applying these patterns to the input documents using GATE, we obtain a list of candidate 
noun phrases and proper nouns S p . 

Sequential pattern mining. 

Lexical pattern-based learning is practical and appropriate for deep knowledge discovery, 
but the competence of its results depends strongly on the completeness of the set of lexical 
rules. To overcome this weakness, we adopt the advantages of frequent sequence mining in 
natural language. Based on the assumption that a concept might be a phrase or a part of a 
phrase in which element words usually appear in fixed orders, we consider a concept as a 
sequence of ordered words. Concepts might be obtained by mining frequent sequential pat¬ 
terns from the documents where each noun phrase is considered as a transaction. 

In our research, we use segmented sentences with chunking labels as input for sequen¬ 
tial pattern mining. As the input sentences are segmented into frames which have specific 
grammatical roles, a concept often belongs to only one frame. We consider each frame as 
a sequence and each word as an item. By mining frequent sequences we can obtain word 
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sequences that frequently occur together in a frame, and, hence, they can become candidate 
concepts. For example: 

• Input: Dien thoai Iphone 4 mail trang chira dirge san xuat 

• Meaning: White Iphone 4 has not been produced yet 

• Chunks: [Dien thoai Iphone 4 mau tring] NP [chira duroc san xuaf\ vp 

Assuming that one of the frequent sequences is “Iphone 4,” we can see the candidate 
concept completely belongs to the chunk [Dien thoai Iphone 4 mau trang] Np . 

After the mining stage, we only use maximal frequent sequences as candidate phrases. The 
set of frequent sequences is denoted by S F . 

Concept identification. 

By executing the above candidate concept discovery processes, we develop a list of can¬ 
didate terms and phrases for concept identification. We need a filter mechanism to select 
relevant concepts for the ontology. The filter algorithm aims to merge three sets of candi¬ 
dates into a unique set of concepts, in which the candidates with lower TF-IDF scores are 
removed. The steps of concept identification algorithm are shown in Figure 4. 


Input: 

S-n=| DF : Setof candidate terms being selected based onTRDF 

S P : Setof candidate phrases and propernouns resulted by using lexical rules 

S F : Set of maximal frequentsequences resulted by using PrefixSpan 

Output: Setof concepts C 

//filter fay using frequentpaItems 

for (every sequence pje S P ) 

if 3 f K e S fr p j =f k ffpj is a freq uentpa ttern 

C<- P| // addpj into set C 

//filter fay using TF-IDF threshold 
for (every sequence s 5 e C) 

//for every term of sequence s } having a TFIDF scoreless than the threshold 
if Sterm tjE s^and term t k e S-n=| DFp tpt k 
remove 55from C 

endif 

endfor 
return C 

Figure 4. Concept identification algorithm. 


Conceptual Relation Discovery 

In this phase, we use the combination of pattern-based and association-based learning to 
discover the relation between concepts. Using lexical rules in conceptual relation discovery 
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has high reliability since lexical rules were predefined by humans based on linguistic rules; 
however, predefined grammatical rules may not cover all cases of language usage. We use 
the advantage of association rule mining to overcome this weakness, in which relations be¬ 
tween concepts are mined without considering the semantic aspect of sentences. 

Lexical pattern-based conceptual relation discovery. 

In this process, we take into account the semantic relations between elements in sentences, 
as used by Nguyen and Phan (2009). According to Vietnamese grammatical characteristics, 
some rules of relations between nouns or noun phrases are as follows: 

• Rule 1: {Noun phrase A} “la mot” {Noun phrase B} --> A is a B 

• Rule 2: {Noun phrase A} {Proper noun B} --> B is an instance of A 

• Rule 3: {Noun phrase A} “co” {Noun phrase B} --> A has a B 

• Rule 4: {Noun phrase A} “cua” {Noun phrase B} --> B has an A 

• Rule 5: {Noun phrase A} “thuoc” {Noun phrase B} --> A is a subclass of B 

• Rule 6: {Noun phrase A} “bao gom” {Noun phrase B},{Noun phrase C} --> B and C 

belong to A 

• Rule 7: {Noun phrase A} (“va” | “hoac”) {Noun phrase B} ~> A (and | or) B 

Based on these rules, we build a set of extraction rules using JAPE grammar. When the 
matching process is invoked, matched concept pairs and the relations between them will 
be discovered. In this research, we focus on finding subsumption relations and instances 
of concepts. The set of lexical rules contains many language usage cases that imply isA and 
hasA relations. Building a complete set of rules is not feasible; however, the rule sets can be 
enriched in further study. 

Heuristic for Concept and Conceptual Relation Discovery 

Context implication. 

If A is a concept and B appears with A in the context {A} (“va” | “hoac”) {B}, we can infer 
that 1) B is also a concept and 2) A and B have the same level of abstraction. For example: 

• Bien thoai HTC Hero va Motorola Milestone deu dirge cai dat he dieu hanh Android 
2.1. 

• Both HTC Hero and Motorola Milestone are installed with the operating system 
Android 2.1. 

In this context, if we already know HTC Hero is a concept being recognized as a kind of 
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mobile phone, we can easily infer that Motorola Milestone is also a kind of mobile phone. 

Incremental learning. 

Obviously, each domain has some cornerstone concepts, said seed concepts, which occur 
in many documents within the corpus. The learning phase should start from these seed 
concepts to discover concepts at a lower level of abstraction and repeat the process in an 
incremental manner. 

For example, in the domain of mobile phone, we can start by some commonly used con¬ 
cepts such as mobile phone, keyboard, screen, and operating system. Based on Vietnamese 
characteristics, a child-concept often begins with its parent-concepts. We define the prefix- 
based concept and conceptual relation discovery algorithm below: 

Given a concept C seed as the seed concept selected by a user, if a concept G begins with C seed , 
G might also be a relevant concept that should be selected by the user and G is a child- 
concept of C seed (in the ontology, G becomes a subclass of the class C seed ). 

By executing this algorithm on seed concepts, we can incrementally obtain a tree of con¬ 
cepts. This tree can be used as a part of the concept hierarchy for the ontology containing 
“isA” relations between child-concepts and its parents. An example of using this approach 
is shown in Table 2. 

Table 2 

An Example of Child-Concepts Generalization using Seed Concepts and Prefix-Based Con¬ 
cept Discovery Algorithm 


Seed 

concepts 

1_child concepts 

2_child concepts 

Meaning 

man hlnh 



screen 


man hlnh cam ima 


touch screen 



man hinh cam ima dien tra 

resistive touch sceen 



man hlnh cam ima dien duna 

caDacitive touch screen 

ban phim 



keyboard 


ban Dhi'nn QWERTY 


QWERTY keyboard 


ban ohim cam ima 


touch keyboard 


Learning from instance. 

Typically, a class name rarely co-occurs with its subclasses or its properties in a sentence. 
Instead, instances of that class usually appear together with its related concepts. For ex¬ 
ample: 
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• HTC Hero dirge trang bj man hlnh cam Cmg efien dung 4.3 inchesva bo nhd trong 


1.5GB. 


• HTC Hero is equipped with a capacitive touch screen 4.3 inches and internal 
memory 1.5GB. 

If we already know the class of the instance, we generalize the abstraction of the instance by 
replacing it with its class and obtain the relation between the class and discovered concepts. 
In this example, mobile phone will have a “hasA” relation with screen and internal memory. 

Association Rule-Based Conceptual Relation Discovery 

Frequent sequential pattern mining can help in cases when lexical patterns cannot be ap¬ 
plied. We use association rule mining to find hidden (anonymous) relations between con¬ 
cepts by taking into account their co-occurrence in contexts, both on a sentence level and 
document level. 

An association rule reflects an implication between its two sides. Let T = {t | i = l, 2,..., 
n} denotes a set of transactions, where each transaction is a list of items. I = {i i? i 2 , ..., i m } 
denotes a list of items. An association rule of “A implies B” states that A associates with B, 
where A and B belong to I, and the intersection of A and B is empty. A rule “A implies B” 
indicates that the appearance of A is followed by the appearance of B with an acceptable 
probability. The reliability of a rule is expressed by two measures support and confidence. 




That is, support is the probability to see both A and B appear in the same transaction while 
confidence is the probability to see the consequence B when the antecedent A appears in a 
transaction. 

At the sentence level, we aim to find concept pairs that often appear together in a sentence. 
We consider each sentence as a transaction where each term is an item. If a concept is dis¬ 
covered in a sentence, terms that belong to the concept will be merged into one item. The 
result of the association rule mining stage will be in a pair of concepts (C A , C B ) with support 
and confidence exceeding predefined thresholds. Nevertheless, we also want to find the 
verb that connects these two concepts and the result is in the form of (C A , C B , Verb con ). 

At the document level, we consider each sentence as a transaction where items are concepts 
that appear in it. The assumption under this mining stage is that two concepts occurring in 
different sentences may have a relation between them. The results are concept pairs of (C x , 
C Y ) that may not co-occur in one sentence. Results of association rule mining, both at the 
sentence level and document level, are presented to users as pairs of concepts. Users will 
make their own judgment on the final selection. 
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Experimental Results 


To evaluate the performance of our work, we built a real Vietnamese ontology for the mo¬ 
bile phone domain as a base for comparison, which is called reference ontology, 0 R . It was 
manually built by using the source documents from the online technical specifications of 
various smartphones like iPhone, HTC Hero, Motorola Milestone, Samsung Galaxy, and so 
on. 


Applying our semi-automatic approach to build an ontology, called computer learned on¬ 
tology O c , we used Vietnamese news on many kinds of mobile phones from 2009 to 2010 as 
the corpus. The corpus includes about 500 news articles from Vietnamese Web sites cover¬ 
ing such topics as the arrival of new phones, comparisons of various phones, sharing phone 
usage experiences, symptoms of phone problems, and their advantages and disadvantages. 
To obtain the input corpus, we used a crawler to download the entire subfolders of three 
source Web sites. Web pages published before 2009 were not included. Then we manually 
selected HTML pages that contain well-known phone brands such as iPhone, Nokia, HTC, 
Samsung, Motorola, Acer, and Sony Ericsson. The final collection of the HTML pages was 
used as input documents. 

Firstly, we used HTMLParser to extract contents from the HTML files to generate text files. 
The text files were tokenized, POS tagged, and segmented using Vietnamese processing 
tools: vnToolkit and VLSP tools. To make the Vietnamese text documents suitable for con¬ 
cept extraction with GATE, we developed a convertor to convert them into annotated docu¬ 
ments that can be used by GATE. The annotation sets contain POS labels and chunk labels 
of tokens. 

We constructed a set of Vietnamese lexical rules using JAPE grammar for pattern-based 
discovery. PrefixSpan is used for mining frequent sequential patterns and association rule- 
based discovery. The pattern concept extraction is executed by GATE transducer based on 
a set of lexical patterns. Results of this stage are recorded as annotations in the annotation 
set of the documents to be used as input for conceptual relation extraction. The final results 
of concept set and conceptual relation set are proposed to the users for manual selection. 
Selected objects and relations are exported to OWL model by Jena toolkit. 

Finally, to evaluate the performance of concept discovery algorithms, we compute the term 
precision and term recall scores on the comparison between the ontologies O c and 0 R . To 
evaluate how well relations were learned, we used the measures of taxonomic precision and 
taxonomic recall. These scores were acquired by comparing the concept hierarchies of the 
two ontologies based on their position of common concepts. 

Evaluation of Concept Extraction 

We adopted the most commonly used measurements for information retrieval, term preci¬ 
sion and term recall, to measure the performance of concept extraction methods. These 
measures are computed based on the overlap between the set of concepts in the reference 
ontology 0 R and the computer learned ontology O c . Let ST be the term set of the reference 
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ontology and T be the term set discovered by our method. We have: 

|5rnr| 

Term Precision = —;—;— 

in 

|5TnT| 

T erm Recall — —;-;— 

|5T| 

We also use F b -measure with the same weights of precision and recall to measure the over¬ 
all performance, where F b -measure is computed by: 

(1 + fl 2 ). Precision. Recall 
^ (f 3 2 .Pre cision + Re cal l) 

Here b is recall weight against precision weight. Since the purpose of our system is extract¬ 
ing and presenting as many materials as possible to users, we select F 2 which weights recall 
twice as much as precision due to the requirement on the completeness of the constructed 
ontology. This weight means that the more extracted concepts satisfying user requirements, 
the better result we will get. 

(1 H- 2 2 ). Precision. Recall 
(2 2 .Precision + Recall) 


In order to examine the scalability of the system, we tested the extraction performance with 
various corpus sizes. We divided the set of documents into five subsets as listed in Table 3. 
There are one large dataset, two small datasets, and two medium datasets. 

Table 3 

Description of Test Document Sets 



Number of files 

Total size 

Number of contained concepts 

Subsetl 

195 

1.37 MB 

1412 

Subset2 

28 

249 KB 

489 

Subset3 

34 

214 KB 

340 

Subset4 

62 

768 KB 

696 

Subsets 

60 

628 KB 

746 


In our system, we use two parameters TF-IDF threshold b and minimum support m of fre¬ 
quent sequence mining to drive learning algorithms. Originally, when both two parameters 
are set as zero, only lexical pattern-based concept extraction algorithm is executed. By in¬ 
creasing the TF-IDF threshold value or minimum support score, extracted concepts will be 


Vol 13 | No 5 


Research Articles 


December 2012 


162 
















A Semi-Automatic Approach to Construct Vietnamese Ontology from Online Text 


Nguyen and Yang 

filtered by the corresponding parameter. These adjustments affect the number of extracted 
concepts. The higher the parameter values are set, the fewer the number of concepts that 
will be extracted. According to the size of the constructed ontology, users can adjust these 
parameters to increase or decrease the size of the set of concepts. 

Table 4 presents our results in detail, where 8 is the TF-IDF threshold, M is the minimum 
support, T is the number of extracted concepts, STQT is the number of relevant extracted 
concepts, P is the precision value, R is the recall value, and F is the F 2 -measure value. 

Table 4 

The Performance of Concept Extraction 



5 

M 

T 

STQT 

p 

R 

F 

5 

M 

T 

STQT 

p 

R 

F 

SI 



4888 

1394 

0.29 

0.99 

0.67 



4844 

1384 

0.28 

0.98 

0.65 

S2 



1270 

489 

0.40 

1.00 

0.77 



1042 

444 

0.43 

0.91 

0.74 

S3 

0 

1 

819 

340 

0.42 

1.00 

0.78 

0.01 

1 

767 

322 

0.42 

0.95 

0.76 

S4 



2005 

694 

0.35 

1.00 

0.73 



1438 

532 

0.37 

0.77 

0.63 

S5 



1846 

746 

0.40 

1.00 

0.77 



1647 

680 

0.41 

0.91 

0.73 

SI 



2782 

932 

0.34 

0.66 

0.56 



2781 

932 

0.34 

0.66 

0.56 

S2 



696 

311 

0.45 

0.64 

0.59 



646 

304 

0.47 

0.62 

0.58 

S3 

0 

2 

480 

233 

0.49 

0.69 

0.64 

0.01 

2 

455 

220 

0.48 

0.65 

0.61 

S4 



1256 

500 

0.40 

0.72 

0.62 



978 

412 

0.42 

0.59 

0.55 

S5 



1846 

746 

0.40 

1.00 

0.77 



1130 

529 

0.47 

0.71 

0.64 

SI 



1239 

492 

0.40 

0.35 

0.36 



1238 

492 

0.39 

0.34 

0.35 

S2 



294 

151 

0.51 

0.31 

0.34 



274 

145 

0.53 

0.30 

0.33 

S3 

0 

4 

241 

137 

0.57 

0.40 

0.43 

0.01 

4 

235 

132 

0.56 

0.39 

0.42 

S4 



597 

295 

0.49 

0.42 

0.43 



507 

259 

0.51 

0.37 

0.39 

S5 



542 

290 

0.54 

0.39 

0.41 



525 

285 

0.54 

0.38 

0.40 

SI 



4888 

1394 

0.29 

0.99 

0.67 



3773 

1110 

0.30 

0.79 

0.60 

S2 



1254 

489 

0.39 

1.00 

0.76 



465 

216 

0.46 

0.44 

0.44 

S3 

0.005 

1 

809 

337 

0.42 

0.99 

0.78 

0.02 

1 

486 

200 

0.41 

0.59 

0.54 

S4 



1859 

658 

0.35 

0.95 

0.71 



683 

266 

0.39 

0.38 

0.38 

S5 



1768 

722 

0.41 

0.97 

0.76 



943 

401 

0.43 

0.54 

0.51 

SI 



2782 

932 

0.34 

0.66 

0.56 



2481 

862 

0.34 

0.61 

0.53 

S2 



686 

311 

0.46 

0.63 

0.59 

0.02 

2 

335 

186 

0.56 

0.38 

0.41 

S3 

0.005 

2 

478 

232 

0.49 

0.68 

0.63 



289 

144 

0.50 

0.42 

0.43 

S4 



1197 

490 

0.41 

0.70 

0.61 



494 

222 

0.45 

0.32 

0.34 

S5 



1209 

553 

0.46 

0.74 

0.66 



728 

348 

0.48 

0.47 

0.47 

SI 



1524 

586 

0.39 

0.42 

0.41 



1365 

541 

0.40 

0.38 

0.38 

S2 



363 

183 

0.51 

0.37 

0.39 



198 

116 

0.59 

0.24 

0.27 

S3 

0.005 

3 

298 

166 

0.56 

0.49 

0.50 

0.02 

3 

188 

107 

0.57 

0.31 

0.34 

S4 



720 

336 

0.47 

0.48 

0.48 



317 

166 

0.52 

0.24 

0.27 

S5 



652 

334 

0.51 

0.45 

0.46 



406 

212 

0.52 

0.28 

0.31 
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Results of extraction performance show that the lexical pattern based approach can pro¬ 
duce high recall but relatively low precision. By increasing the TF-IDF thresholds and min- 
sup values, precision can be improved. Figure 5 shows the scalability of our system with 
respect to corpus size when 8 = 0.005 and M = 2. In our test cases, the system produced 
best performance when it was tested with a medium size corpus; increasing corpus size can 
make the extraction performance go down. Overall, after testing with various sets of in¬ 
put documents, the extraction performance was around 60% when appropriate thresholds 
were specified. 



Figure 5. Extraction performance with respect to corpus size. 

Evaluation of Conceptual Relation Extraction 

As our purpose in the relation extraction step is to find subsumption relations, the lexical 
rules are built to discover patterns that contain “isA” and “hasA” relations. In the set of re¬ 
lations found by association rule-based extraction, we only used rules that imply “isA” and 
“hasA” relations. These relations are used to construct the concept hierarchy of the ontol¬ 
ogy. Some of the top extracted relations are shown in Table 5. 
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Top Extracted Relations 

Nguyen and Yang 

Relations 

English meaning 

dien_thoai - hasA- man_hlnh 

phone - hasA - screen 

dien_thoai - hasA - ban_phim 

phone - hasA - keyboard 

man_hlnh - hasA - do_phan_giai 

screen - hasA - resolution 

dien thoai- hasA - he_dieu_hanh 

phone - hasA - operating_system 

Android - isA- he_dieu_hanh 

Android - isA - operating_system 

man_hinh_cam_Cmg - isA - man hlnh 

Touch_screen - isA- screen 

man_hinh_cam_Cmg_dien_dung - isA - 
ma n_h 1 n h_cam_Cmg 

Capacitive_touch_screen - isA - touch_screen 

ban_phim_QWERTY - isA-ban_phmn 

QWERTY_keyboard - isA - keyboard 


Figure 6 is an illustration of some top extracted conceptual relations in the computer 
learned ontology being translated into English. 


5rnartphorce^^) 


h — 



1 as A | 


h-asA 


(^^k^oard^^ 


□peratins_5y5tem 



(^^t^chscreen^^ 


IsA . 

capacitive_ 

touchscreen 



Figure 6. Some learned top level relations. 


In order to evaluate how well the concept hierarchy was constructed in the ontology, we 
used taxonomic precision (TP) and taxonomic recall (TR) as proposed by Dellscaft and 
Staab (2003), in which the position of a concept in the learned hierarchy is compared with 
the same concept in the reference hierarchy. TP and TR are computed based on common 
semantic cotopy (esc) which measures the taxonomic overlap of two ontologies. The com¬ 
mon semantic cotopy excludes all concepts which are not also available in the other ontol¬ 
ogy’s concept set. Given a concept c, two ontologies O i and 0 2 , the common semantic cotopy 
(esc) is defined as follows: 
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CSc(Cj O-^f 0^3 ■ ^ O < t C V C 

where C 1 and C 2 are two sets of concepts for ontologies 0 1 and 0 2 , respectively. 

TP and TR are computed based on common semantic cotopy as follows: 

1 


TP csc (0 lr 0 2 ) := 


lq nc 2 | 


I 


tPcsc ^z) 


cec L nCn 

TR( 0 ± , 0 2 ) := TRiO^Qi) 


where tp csc (c, c, O r 0 2 ) is a local precision on common semantic cotopy of the concept c 
and computed by: 


t'Pcsc C ^1/ O a ) 


| csc(c, 0^) n esc(c, 0 2/ G*i) | 
| esc(c, 0 V 0 7 ) | 


Based on TP and TR, we can compute taxonomic F-measure as: 


TF( 0 v 0 2 ) 


2 .rp(o 1J o 2 ).rfl(o 1J o 2 ) 
rpCo v o 2 ) + rfl(o v o 2 ) 


Given two ontologies 0 1 and 0 2 in which 0 1 is the computer learned ontology and 0 2 is the 
reference (or standard) ontology, a part of the evaluation is shown in Table 6. We only take 
different parts of the two ontologies into consideration, in which a concept c in O i has a dif¬ 
ferent position as in 0 2 . A number of concepts are not considered in this evaluation since 
they are leaf concepts linked to root nodes (things) in both ontologies. There is no need to 
find position difference for these concepts. 
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Table 6 


Comparison of the Learned Ontology and Reference Ontology 



c 

cscfc, 0 ., 0 d 

cscfc.O,.Od 

a 

root, b, c 

v j 2 V ly 

root, b, c, d 

b 

root, a 

root, a 

c 

root, a 

root, a 

d 


root, a 

e 

root 

root 

f 

root, h 

root, h 

£ 

f 

f 

h 

f 

f 

i 


root, j, k, 1, m, 
n 

j 

root 

root, i 

—1 

k 

root 

root, i 

k 

root 

root, i 

1 

root 

root, i 

m 

root 

root, i 

m 

root 

root, i 

0 


root, p, q, r , 
s f 

P 

root 

root, 0 

dr 

q 

root 

root. 0 

r 

root 

root, 0 

s 

root 

root, 0 

t 

root 

root, 0 


Description: a: file format; b: music file; c: video file; d: other file; e: smartphone; f: network; g: 2 G 
network; h: 3 G network; i: phone software; j: operating system; k: applications; 1 : music player; m: 
email; n; call feature; o; p hone har dwa re; p ; k eyb oard; q; memo ry; r; screen; s; batte ry; t; earphone_ 


Based on the analysis shown in Table 6, we can compute T P csc (0 1} 0 2 ) — 100 % 
TP C3 C (O v 0 2 ) = 100 %, TR CSC (0 V 0 2 ) = TP^O^Oj = 68 . 05 % 
TR FSC (0 lr 0j = TP C3C {0 2r 0^ = 68 . 05 %, and T F csc (0 1 , 0 2 ) = 0.8 
TF csc {O v O z ) = 0 . 8 . 

Usage of ViText20nto in the Education Domain 

Using ViText 20 nto, we built an ontology in the education domain to show its application in 
real-world projects. The purpose of this project is to build a recommender system of course 
selection for students of the Information Technology Department at Tra Vinh University. 
This recommender system takes student profiles as input to output a list of recommended 
courses. A student profile is created based on the courses taken by the student so far. The 
knowledge base of this system is an ontology containing information about all courses of 
the bachelor program in information technology in the school. 
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The ontology was built based on descriptions of 50 courses in which each course descrip¬ 
tion is stored in one text document. In each document, a course is presented by many units 
of knowledge that students must learn. Each unit corresponds to a learning objective. Each 
document includes course name, list of learning objectives, list of knowledge units, list of 
chapter titles, and the schedule of the course. These documents are available in the school’s 
e-learning system before the beginning of each semester. 

We use ViText 20 nto to obtain as many concepts as there are courses and learning objec¬ 
tive names to construct the ontology. Due to the structure of the source documents, each 
document only contains a plain list of learning objectives that belong to the correspond¬ 
ing course in which each learning objective is presented by a noun phrase, not a whole 
paragraph of full sentences like in news or in other types of documents. We did not use the 
relation extraction feature for this ontology. Consequently, after extracting concepts from 
documents, we put them in the ontology manually as a semi-automatic approach. 

In our ontology, concepts that belong to a course form a concept tree. The concept trees of 
all courses in the program form the structure of the ontology. Using ViText 20 nto, we were 
able to extract 60% of the concepts used in the ontology. For example, Figure 7 illustrates a 
concept tree for the course Introduction to C Programming. 
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Figure 7. Concept tree for the course of Introduction to C Programming 

From a general perspective, the performance of the proposed system is acceptable in sup¬ 
porting users to construct a Vietnamese ontology, in which the labor cost and time con¬ 
sumption are reduced significantly by using the semi-automatic concept extraction method. 
The accuracy of the system reaches above 50% with our testing datasets. More effort and 
further studies are on the way to boost the execution of the extraction phase. We believe the 
overall performance can be improved. 


Conclusion and Future Research 


In this research, we proposed a support system for Vietnamese ontology construction using 
the combination of lexical pattern-based, statistics-based, and frequent sequence pattern- 
based methods. The integrated approach can overcome the weaknesses of each individual 
method which may lead to missing concepts and relations in the discovery task. We also 
built a real Vietnamese ontology in the mobile phone domain using our proposed system. 
Then it was compared with a golden standard of manually constructed ontology. The evalu¬ 
ation shows that our approach has acceptable performance in concept and relation discov¬ 
ery. 
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In addition, the constructed ontology can be used as a knowledge base in many applications 
such as a recommendation system, text classification, and information retrieval. Based on 
our model many knowledge bases can be constructed easily such that more materials are 
available in open and distance learning. 

In the near future, we would like to further automate the ontology construction by auto¬ 
matically learning the taxonomy part of ontology from text documents. Alternative meth¬ 
ods of more efficient concept extraction will be considered to take the semantic aspect of 
documents into account. 
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